Back Propagation:
Simple Example:f ( x , y , z ) = ( x + y ) z f(x,y,z)=(x+y)z f ( x , y , z ) = ( x + y ) z
1.Forward Pass: Compute outputs from left to right(from inputs to outputs)
q = x + y q=x+y q = x + y f = q z f=qz f = q z
2.Backward Pass: Compute derivatives
Want: ∂ f ∂ x , ∂ f ∂ y , ∂ f ∂ z \frac{\partial f}{\partial x},\frac{\partial f}{\partial y},\frac{\partial f}{\partial z} ∂ x ∂ f , ∂ y ∂ f , ∂ z ∂ f
Order: ∂ f ∂ f → ∂ f ∂ z → ∂ f ∂ q → ∂ f ∂ y = ∂ f ∂ q ∂ q ∂ y → ∂ f ∂ x = ∂ f ∂ q ∂ q ∂ x \frac{\partial f}{\partial f}\rightarrow \frac{\partial f}{\partial z}\rightarrow \frac{\partial f}{\partial q}\rightarrow \frac{\partial f}{\partial y}=\frac{\partial f}{\partial q}\frac{\partial q}{\partial y}\rightarrow \frac{\partial f}{\partial x}=\frac{\partial f}{\partial q}\frac{\partial q}{\partial x} ∂ f ∂ f → ∂ z ∂ f → ∂ q ∂ f → ∂ y ∂ f = ∂ q ∂ f ∂ y ∂ q → ∂ x ∂ f = ∂ q ∂ f ∂ x ∂ q
[Downstream] = [Local]*[Upstream]
sigmoid function:σ ( x ) = 1 1 + e − x \sigma(x)=\frac{1}{1+e^{-x}} σ ( x ) = 1 + e − x 1
d σ ( x ) d x = ( 1 − σ ( x ) ) σ ( x ) \frac{d\sigma(x)}{dx}=(1-\sigma(x))\sigma(x) d x d σ ( x ) = ( 1 − σ ( x )) σ ( x )
add: don’t change derivatives so the elements linked by the “+” share the same gradient.
copy: one side’s derivatives add to the other side.
multiply: swap multiplier
max: gradient router,that reduces one element to 0 0 0 and another,full gradient the same with other side’s element’s.
Flat: “Reverse Thinking Method”:
#Forward
def f (w0,x0,w1,x1,w2):
s0 = w0 * x0
s1 = w1 * x1
s2 = s0 + s1
s3 = s2 + w2
L = sigmoid(s3)
#Backward
grad_L = 1.0
grad_s3 = grad_L * ( 1 - L) * L #sigmoid function has a special form of derivatives
grad_w2 = grad_s3 #gradient copier
grad_s2 = grad_s3 #gradient copier
grad_s0 = grad_s2 #gradient copier
grad_s1 = grad_s2 #gradient copier
grad_w1 = grad_s1 * x1 #gradient multiplier
grad_x1 = grad_s1 * w1 #gradient multiplier
grad_w0 = grad_s0 * x0 #gradient multiplier
grad_x0 = grad_w0 * w0 #gradient multiplier
y ∈ R , x ∈ R M , d y d x ∈ R M y \in \mathbb{R},x \in \mathbb{R}^M,\frac{dy}{dx}\in \mathbb{R}^M y ∈ R , x ∈ R M , d x d y ∈ R M
y ∈ R N , x ∈ R M , d y d x ∈ R M × N y \in \mathbb{R}^N,x \in \mathbb{R}^M,\frac{dy}{dx}\in \mathbb{R}^{M\times N} y ∈ R N , x ∈ R M , d x d y ∈ R M × N :Jacobian Matrix
4D INPUT x x x :[1,-2,3,-3] → f ( x ) = max ( 0 , x ) ( e l e m e n t w i s e ) → \rightarrow f(x)=\max(0,x)(elementwise)\rightarrow → f ( x ) = max ( 0 , x ) ( e l e m e n tw i se ) → 4D OUTPUT y y y = [1,0,3,0]
4D d L d y \frac{dL}{dy} d y d L :[4,-1,5,9] → \rightarrow → d y d x d L d y \frac{dy}{dx}\frac{dL}{dy} d x d y d y d L
d y d x \frac{dy}{dx} d x d y =[ 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ] \begin{bmatrix}1&0&0&0 \\0&0&0&0 \\0&0&1&0\\0&0&0&0 \end{bmatrix} 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 positive: 1 negative: 0
Jacobian is sparse!: off-diagonal entries all zero! When doing a big Jacobian Matrix Multiply,it’ll cause large resources waste because almost every element is zero!Never explicitly form Jacobian,instead use implicit multiplication.
y = x w y=xw y = x w (x : [ N × D ] x:[N\times D] x : [ N × D ] w : [ D × M ] w:[D\times M] w : [ D × M ] y : [ N × M ] y:[N \times M] y : [ N × M ] )
d L d x i , j = d y d x i , j ⋅ d L d y = w j , : ⋅ d L d y i , : \frac{dL}{dx_{i,j}}=\frac{dy}{dx_{i,j}}\cdot\frac{dL}{dy}=w_{j,:}\cdot\frac{dL}{dy_{i,:}} d x i , j d L = d x i , j d y ⋅ d y d L = w j , : ⋅ d y i , : d L
d L d x = d L d y w T \frac{dL}{dx}=\frac{dL}{dy}w^T d x d L = d y d L w T (How to remember? Use the shape!)
d L d w = x T d L d y \frac{dL}{dw}=x^T\frac{dL}{dy} d w d L = x T d y d L
Hint: ⋅ \cdot ⋅ means inner product while blank space means matrix multiply
Backward-Mode: A vector input and a scalar output
Forward-Mode: A scalar input and a vector output
Compute Higher-Order Derivatives(Cool!):
x 0 − − f 1 − − > x 1 − − f 2 − − > L − − f 2 ′ − − > d L d x 1 − − f 1 ′ − − > d L d x 0 − − ⋅ v − − > d L d x 0 ⋅ v x_0 --f1-->x1--f2-->L--f_2'-->\frac{dL}{dx_1}--f_1'-->\frac{dL}{dx_0}--\cdot v-->\frac{dL}{dx_0}\cdot v x 0 − − f 1 − − > x 1 − − f 2 − − > L − − f 2 ′ − − > d x 1 d L − − f 1 ′ − − > d x 0 d L − − ⋅ v − − > d x 0 d L ⋅ v
we want to calculate ∂ 2 L ∂ x 0 2 \frac{\partial^2L}{\partial x_0^2} ∂ x 0 2 ∂ 2 L then we can calculate∂ 2 L ∂ x 0 2 ⋅ v \frac{\partial^2L}{\partial x_0^2}\cdot v ∂ x 0 2 ∂ 2 L ⋅ v , and surprisingly,∂ 2 L ∂ x 0 2 ⋅ v = ∂ ∂ x 0 [ ∂ L ∂ x 0 ⋅ v ] \frac{\partial^2L}{\partial x_0^2}\cdot v=\frac{\partial}{\partial x_0}[\frac{\partial L}{\partial x_0}\cdot v] ∂ x 0 2 ∂ 2 L ⋅ v = ∂ x 0 ∂ [ ∂ x 0 ∂ L ⋅ v ]
(v v v is independent from x 0 x_0 x 0 )
use backprop we will get the answer(remember: backprop gets the derivatives of output with regard to the input)
再分析:倒着写一遍
参考链接:https://zhuanlan.zhihu.com/p/21407711